The data of this analysis can be found here.
The goal of the analysis is to represent the growth of the fishes as a function of the age and water temperature.
The fish are kept in tanks at 25, 27, 29 and 31 degrees Celsius.
After birth, a test specimen is chosen at random every 14 days and its length measured. There are 44 rows of data.
The data include:
| age | temperature | length |
|---|---|---|
| 14 | 25 | 620 |
| 28 | 25 | 1315 |
| 41 | 25 | 2120 |
| 55 | 25 | 2600 |
| 69 | 25 | 3110 |
| 83 | 25 | 3535 |
From the graphs we can already see something interesting.
The growing function of the fishes seems to behave well, so hopefully we can estimate it properly.
The temperature seems to have a clear effect on our fishes: it does nothing until a critical value, but when it reaches 31 Celsius degree has an evident impact, reducing the length of the fishes.
In a frequentist framework, the dubmest thing that we can do is to throw a linear regressor on everything and see what happens.
From now on we will refer as:
This model is given by:
\[z= ax + by +c\]
z=df$length
x=df$age
y=df$temperature
# Compute the linear regression (z = ax + by + d)
fit <- lm(z ~ x + y)
summary(fit)##
## Call:
## lm(formula = z ~ x + y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1406.27 -398.59 30.73 313.94 1291.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3904.266 1149.044 3.398 0.00152 **
## x 26.241 2.055 12.769 7.11e-16 ***
## y -106.414 40.452 -2.631 0.01195 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 600 on 41 degrees of freedom
## Multiple R-squared: 0.8056, Adjusted R-squared: 0.7962
## F-statistic: 84.98 on 2 and 41 DF, p-value: 2.607e-15
This model gives us a multiple R-squared of \(80\)%, not bad, but we can surely do better.
Let’s take a look at the fitted plane:
As we could expect, the simple flat plane is not capable of recovering the curvature of the function, moreover, this model is not able to get the influence of the temperature.
As we can read in this study by Johannes Hamre Espen Johnsen and Kristin Hamre, a lot of fishes growth function can be estimated with a second order polynomial fit. So let’s try to fit it (for a moment we leave out the temperature factor).
\[z = a + bx +cx^2\]
##
## Call:
## lm(formula = df$length ~ poly(df$age, 2, raw = T), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1099.77 -81.28 75.85 320.95 741.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -364.91117 255.59918 -1.428 0.161
## poly(df$age, 2, raw = T)1 69.04526 7.05444 9.787 2.75e-12 ***
## poly(df$age, 2, raw = T)2 -0.25642 0.04117 -6.229 2.05e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 465 on 41 degrees of freedom
## Multiple R-squared: 0.8833, Adjusted R-squared: 0.8776
## F-statistic: 155.1 on 2 and 41 DF, p-value: < 2.2e-16
We reach an r squared of \(90\)%, so the second grade polynomial is a good way to estimate our function.
Obviously our model is biased through the non-hot function, since we have three registrations vs one.
To bring in temperature in our model, we can start by trying to fit this polynomial regression by taking the two groups, and analyze the result.
df=df_clean
df=df[1:33,]
df=df[order(df$age),]
cold_model <- lm(df$length ~ poly(df$age,2, raw = T), data=df)
predicted.intervals <- predict(cold_model,data.frame(x=df$age),interval='confidence',
level=0.99)
cold_predict<- function(x) coef(cold_model)[1] + coef(cold_model)[2] * x + coef(cold_model)[3] * x^2df=df_clean
df=df[34:44,]
df=df[order(df$age),]
hot_model <- lm(df$length ~ poly(df$age,2, raw = T), data=df)
predicted.intervals <- predict(hot_model,data.frame(x=df$age),interval='confidence',
level=0.99)
hot_predict<- function(x) coef(hot_model)[1] + coef(hot_model)[2] * x + coef(hot_model)[3] * x^2##
## Call:
## lm(formula = df$length ~ poly(df$age, 2, raw = T), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1099.77 -81.28 75.85 320.95 741.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -364.91117 255.59918 -1.428 0.161
## poly(df$age, 2, raw = T)1 69.04526 7.05444 9.787 2.75e-12 ***
## poly(df$age, 2, raw = T)2 -0.25642 0.04117 -6.229 2.05e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 465 on 41 degrees of freedom
## Multiple R-squared: 0.8833, Adjusted R-squared: 0.8776
## F-statistic: 155.1 on 2 and 41 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = df$length ~ poly(df$age, 2, raw = T), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -181.62 -69.91 -59.86 108.31 153.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43.50419 137.33227 -0.317 0.76
## poly(df$age, 2, raw = T)1 52.56762 3.79032 13.869 7.07e-07 ***
## poly(df$age, 2, raw = T)2 -0.20858 0.02212 -9.430 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124.9 on 8 degrees of freedom
## Multiple R-squared: 0.9848, Adjusted R-squared: 0.981
## F-statistic: 259.6 on 2 and 8 DF, p-value: 5.303e-08
##
## Call:
## lm(formula = df$length ~ poly(df$age, 2, raw = T), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -303.26 -64.45 1.38 74.18 545.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -472.04682 107.12264 -4.407 0.000124 ***
## poly(df$age, 2, raw = T)1 74.53781 2.95654 25.211 < 2e-16 ***
## poly(df$age, 2, raw = T)2 -0.27237 0.01725 -15.786 4.48e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 168.8 on 30 degrees of freedom
## Multiple R-squared: 0.9862, Adjusted R-squared: 0.9853
## F-statistic: 1073 on 2 and 30 DF, p-value: < 2.2e-16
Fixing the temperature (or better; stating that the temperature is under or over the critical level) our r-squared bumps out to \(99\)%, the function becomes really stable.
| model.coefficients | hot_model.coefficients | cold_model.coefficients | |
|---|---|---|---|
| (Intercept) | -364.9111654 | -43.5041889 | -472.0468243 |
| poly(df$age, 2, raw = T)1 | 69.0452649 | 52.5676185 | 74.5378137 |
| poly(df$age, 2, raw = T)2 | -0.2564195 | -0.2085775 | -0.2723669 |
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -881.1041862 | 151.2818554 |
| poly(df$age, 2, raw = T)1 | 54.7985339 | 83.2919959 |
| poly(df$age, 2, raw = T)2 | -0.3395608 | -0.1732782 |
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -360.1929602 | 273.1845824 |
| poly(df$age, 2, raw = T)1 | 43.8271289 | 61.3081081 |
| poly(df$age, 2, raw = T)2 | -0.2595854 | -0.1575696 |
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -690.8204320 | -253.2732166 |
| poly(df$age, 2, raw = T)1 | 68.4997454 | 80.5758821 |
| poly(df$age, 2, raw = T)2 | -0.3076039 | -0.2371298 |
Looking at the parameter estimates and at confidence intervals, we gain another proof that the two models are indeed different.
To build a model that incorporates in a unique function the temperatures, let’s be Bayesian:
Let’s at first construct the same simple model as before; a simple polynomial regression, and try to recover the parameters.
cat("model{
for(i in 1:n){
length[i]~dnorm(mi[i],tau)
mi[i]<- a + b*x[i] +c *pow(x[i],2)
}
#Priors (non-informative)
a ~ dunif(-1000, 1000)
b ~ dunif(-1000, 1000)
c ~ dunif(-1000, -0.001)
tau ~ dgamma(0.01, 0.01)
sigma <- 1 / sqrt(tau)
}", file='fish_model_simple.txt')
dats =list(length= df$length, x=df$age,n=nrow(df))
inits =list(list( a = 300, b = 200, c=-1, tau=1),
list(a = -300, b = -200, c=-0.1, tau=0.1))
params= c( 'a', 'b', 'c', 'sigma')
simple_model =jags(model.file = 'fish_model_simple.txt',data = dats,inits = inits,n.chains = 2,
parameters.to.save = params,n.iter = niter,n.burnin = niter/50)## module glm loaded
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 44
## Unobserved stochastic nodes: 4
## Total graph size: 146
##
## Initializing model
## Inference for Bugs model at "fish_model_simple.txt", fit using jags,
## 2 chains, each with 5e+05 iterations (first 10000 discarded), n.thin = 490
## n.sims = 2000 iterations saved
## mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
## a -358.5 249.4 -847.8 -525.7 -359.6 -188.6 131.6 1 410
## b 68.9 7.1 55.1 64.2 68.9 73.6 82.8 1 770
## c -0.3 0.0 -0.3 -0.3 -0.3 -0.2 -0.2 1 900
## deviance 666.4 2.9 662.8 664.4 665.7 667.7 673.4 1 1700
## sigma 472.7 52.9 381.6 435.8 467.3 505.7 579.6 1 2000
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 4.2 and DIC = 670.6
## DIC is an estimate of expected predictive error (lower deviance is better).
To evaluate the goodness of the model we can use the DIC.
The deviance information criterion is a hierarchical modeling generalization of the Akaike information criterion. It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. DIC is an asymptotic approximation as the sample size becomes large, like AIC. It is only valid when the posterior distribution is approximately multivariate normal.
How to calculate it?
Define the deviance as \({\displaystyle D(\theta )=-2\log(p(y|\theta ))+C\,}\), where \(y\) are the data, \(\theta\) are the unknown parameters of the model and \(p(y|\theta )\) is the likelihood function. \(C\) is a constant that cancels out in all calculations that compare different models, and which therefore does not need to be known.
To calculate the effective number of parameters of the model, as described in Gelman(2004, p. 182) we use \(p_{D}=p_{V}=\frac{1}{2} \hat{var}(D(\theta))\). The larger the effective number of parameters is, the easier it is for the model to fit the data, and so the deviance needs to be penalized.
The deviance information criterion is calculated as
\({\mathit {DIC}}=p_{D}+{\bar {D}}\), or equivalently as
\({\mathit {DIC}}=D({\bar {\theta }})+2p_{D}\).
We get a DIC of 670. Now it’s only a row number, but it will be useful to compare this model with the next model, where we introduce the temperature.
Let’s take a look to the common diagnostic of our MCMC to asses the performance.
density plots are approximately normal shaped(as we like them to be), and are very similar between the two chains.
Pretty good: with this big sample (quite big indeed), we have almost uncorrelated chains.
The model seems capable to recover realistic parameters, however we still keep the problems of the frequentist analysis; we want to encode the temperature in one only model; let’s try to do it.
To introduce the temperature the proposal is to insert to parameters that modify the coefficients of the polynomial when the temperature is too hot(using a transformation of the temperature feature into a binary):
\[z = a + (b-\beta*h)*x +(c-\gamma*h) *x^2\]cat("model{
#Likelihood
for(i in 1:n){
length[i]~dnorm(mi[i],tau)
mi[i]<- a + (b-beta*h[i])*x[i] +(c-gamma*h[i]) *pow(x[i],2)
}
#Priors (non-informative)
a ~ dunif(-1000, 1000)
b ~ dunif(-1000, 1000)
c ~ dunif(-1000, -0.001)
beta ~ dunif(-100, 100)
gamma ~ dunif(-100, 100)
tau ~ dgamma(0.001, 0.001)
sigma <- 1 / sqrt(tau)
}", file='fish_model_complex.txt')
data =list(length= df$length, x=df$age,h=df$hot,n=nrow(df))
init =list(list(beta = 0.1, gamma = 0.1, a = -300, b = -200, c=-1, tau=1),
list(beta = 1, gamma = 1, a = 300, b = 200, c=-0.1, tau=0.1))
params= c("beta", 'gamma', 'a', 'b', 'c', 'sigma')
complex_model =jags(model.file = 'fish_model_complex.txt',data = data,n.chains = 2,
parameters.to.save = params ,n.iter = niter,n.burnin = niter/50,inits = init)## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 44
## Unobserved stochastic nodes: 6
## Total graph size: 234
##
## Initializing model
## Inference for Bugs model at "fish_model_complex.txt", fit using jags,
## 2 chains, each with 5e+05 iterations (first 10000 discarded), n.thin = 490
## n.sims = 2000 iterations saved
## mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
## a -369.1 93.5 -554.1 -430.0 -369.3 -311.3 -181.9 1 2000
## b 72.0 2.6 66.6 70.3 72.0 73.7 77.1 1 2000
## beta 11.2 2.6 6.0 9.4 11.3 12.9 16.0 1 2000
## c -0.3 0.0 -0.3 -0.3 -0.3 -0.2 -0.2 1 2000
## deviance 576.7 3.8 571.7 573.9 575.9 578.7 585.7 1 2000
## gamma 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 2000
## sigma 170.9 19.9 137.8 156.7 169.0 183.6 214.4 1 2000
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 7.1 and DIC = 583.8
## DIC is an estimate of expected predictive error (lower deviance is better).
We get a lower DIC, so our model it’s better.
density plots are approximately normal shaped(as we like them to be), and are very similar between the two chains.
Pretty good: with this big sample (quite big indeed), we have almost uncorrelated chains.
After our analysis, we can state that, according to our data, the temperature is a relevant factor in the growth of fishes, and going further with this study could be helpful to have an idea of how global warming could affect the life of the marine animal population.
How ever, the sample of this work was not-so-big to draw strong conclusion, for example the fact that we have only one heavy-warmed tank could be a factor of bias.
To get more consistent results we should use larger samples.